Question 1:


In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from scipy import optimize
from scipy import spatial

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

sns.set(rc={"figure.figsize": (15, 6)})
sns.set_palette(sns.color_palette("Set2", 10))

In [2]:
lalonde_data = pd.read_csv('lalonde.csv')

Motivations

The problem we try to solve in the question 1 is evaluating the average causal effect of the "treatment" represented by the job training program.

A naive analysis would only compare the difference in mean between the two groups (with and without treatment). By doing so, this only reflect both the average causal effect (ACE) and the selection bias (SB). The latter might drastically change the two averages we are comparing and could lead to a wrong conclusion.

In order to minimize the role of the selection bias, we use the propensity score matching (PSM) technique. The idea is to compare the difference in mean between subsets of the two groups that are similar.

1.1) A naive analysis

Using Kernel Density Estimations (KDE) plots


In [3]:
#Function that plots a boxplot for re78
def compare_groups(data):
    plt.figure(figsize=(10,10))
    sns.boxplot(x='treat', y='re78', data=data, showfliers=False, showmeans=True, meanline=True, meanprops=dict(color='r'))
    plt.xticks(range(2), ["Control Group", "Treatment Group"])
    plt.show()

In [4]:
compare_groups(lalonde_data)



In [5]:
#We keep track of the ratio (Treatment group real earnings in 1978 mean) / (Control group real earnings in 1978 mean) after each improvement done in exercise 1
means_ratio_over_improvement = []

#A function that prints the mean of real earnings in 1978 in both group
def print_means(data):
    data_means = data.groupby("treat").agg(np.mean)
    print("Control group real earnings in 1978 mean: {:.0f}".format(data_means["re78"].loc[0]))
    print("Treatment group real earnings in 1978 mean: {:.0f}".format(data_means["re78"].loc[1]))
    ratio = data_means["re78"].loc[1]/data_means["re78"].loc[0]
    means_ratio_over_improvement.append(ratio)
    print("Ratio (treatment/control): {:.2f}".format(ratio))

In [6]:
print_means(lalonde_data)


Control group real earnings in 1978 mean: 6984
Treatment group real earnings in 1978 mean: 6349
Ratio (treatment/control): 0.91

A naive analysis would claim that there are no clear differences between the two groups and thus would conclude that the "Job Training Program" (JTP) is useless. And if a difference exists, people in the treatment groups have a smaller revenue by 10%, hence the treatment would be worst than no treatment at all.

1.2) A closer look at the data


In [7]:
#Features of each group
main_variables = ['black', 'hispan', 'age', 'married', 'nodegree', 'educ']

#Function that displays a bar plot of each group for every features
def display_proportions(data, variables=main_variables, n_cols=3):
    N = len(variables)
    f, axes = plt.subplots(nrows=int(np.ceil(N / n_cols)), ncols=n_cols)
    f.set_figheight(10)
    for idx, axis, var in zip(range(N), axes.flatten(), variables):
        sns.barplot(x='treat', y=var, data=data, ax=axis)
        axis.set_xticklabels(["Control Group", "Treatment Group"])
        axis.set_xlabel("")
        axis.set_title(idx+1)
        axis.set_ylabel("mean of {}".format(var))

In [8]:
display_proportions(lalonde_data)


Obervations

1: As we can see on the barplot above, the concentration of black people in the treatment group is 4 times as high as in the control group

2: The concentration of hispanic people in the control group is more than twice as high as in the treatment group

3: Treatment group is on average 2 years younger that control group

4: People in the control group are more than twice as likely to be married than the ones in the treatment group

5: The proportion of people without a degree in the treatment group is higher by 20% than in the control group

6: The mean and the variance of the of years of education is more or less the same in both groups

With these 6 observations, we can say that that two group are not uniformly separated and that for this reason, it is dangerous to draw a conclusion from a superficial analysis.

Let's see whether each group has a similar number of sample:


In [9]:
lalond_count = lalonde_data.groupby("treat").agg("count")
print("Number of people in the control group: {}".format(lalond_count["re78"].loc[0]))
print("Number of people in the treatment group: {}".format(lalond_count["re78"].loc[1]))


Number of people in the control group: 429
Number of people in the treatment group: 185

As we can see there is 2.3 times as many sample for the control group. And because of this, we can be picky and select only a part of the samples in the control group that correspond to the samples in the treatment group. To do so, we will match two samples together from each groups corresponding to their propensity score and then only keep and compare the samples matched.

1.3) A propensity score model

Let's calculate the propensity score


In [10]:
from sklearn.linear_model import LogisticRegression

In [11]:
lr = LogisticRegression()
#Select features, that is drop id and treat columns
selectedFeatures = lalonde_data.drop(['id','treat'], axis=1)
#Fit the model
lr.fit(selectedFeatures, lalonde_data['treat']);

In [12]:
#Calculate the propensity scores
propensity_scores = lr.predict_proba(selectedFeatures)

In [13]:
#Only keep the probability of receiving the treatment and store it inside the dataframe
lalonde_data['propensity score'] = [x[1] for x in propensity_scores]

1.4) Balancing the dataset via matching


In [14]:
#One dataframe per group
control_group = lalonde_data[lalonde_data['treat'] == 0]
treatment_group = lalonde_data[lalonde_data['treat'] == 1]

In [15]:
#Compute the distance matrix using the absolute difference of the propensity scores
cost_matrix = spatial.distance.cdist(
    treatment_group["propensity score"].values.reshape((treatment_group.shape[0], 1)),
    control_group["propensity score"].values.reshape((control_group.shape[0], 1)),
    metric=lambda a,b: np.abs(a - b)
)

In [16]:
#Solve the distance matrix to minimze the total cost function. Where the total cost function is the sum of the distances
#And get the indices of the pairs that minimze this total cost function
treatment_ind, control_ind = optimize.linear_sum_assignment(cost_matrix)

In [17]:
#We construct a dataframe whith the rows corresponding to the indices obtaiend above. Note we have the same number of sample in each group by construction
lalonde_ps_matched = pd.concat((treatment_group.iloc[treatment_ind], control_group.iloc[control_ind]))

Now, lets compare the difference in the distribution for each feature in the two groups as done earlier in part 1.2


In [18]:
display_proportions(lalonde_ps_matched)


Observations

1: The difference in the concentration of black people shrinked, however the treatment group's rate is almost still twice the rate of the control group (better than before)

2: The concentration of hispanic people in the control group is now twice as high as in the treatment group (better than before)

3: The control group is on average 2 years younger than the treatment group (same as before, but reversed)

4: People in the control group have now almost the same probability to be married as the ones in the treatment group (better than before)

5: The proportion of people without a degree in the treatment group is higher by 5% than in the control group (less than before (20%) )

6: The mean and the variance of the of years of education is again more or less the same in both groups

Compared to before the matching, the different features are more balanced. The only features that are not roughtly the same are the features that have a racial information in them.


In [19]:
compare_groups(lalonde_ps_matched)



In [20]:
print_means(lalonde_ps_matched)


Control group real earnings in 1978 mean: 5829
Treatment group real earnings in 1978 mean: 6349
Ratio (treatment/control): 1.09

We can now see that the mean in the treatment group is slightly higher than in the control group, where it was slightly below before. Also the maximum, median and quartiles are all bigger than their counterpart in the control group. This is a complete different information from what we had before, but let's improve it even more.

1.5) Balancing the groups further

The main difference in the two groups resides in the proportion of hispanic and black people:

For this reason, we will add the condition when matching two subjects that they have the same value for the hispanic feature. Doing so for the black feature is not possible because 156 people out of the 185 people are black in the treatment group where for the control group there are 87 black people out of the 429 people.


In [21]:
additionnal_feature_matched = 'hispan'

#Compute the distance matrix where a value is 0 if both the row and the colum is hispan, 1 otherwise
add_cost_matrix = spatial.distance.cdist(
    treatment_group[additionnal_feature_matched].values.reshape((treatment_group.shape[0], 1)),
    control_group[additionnal_feature_matched].values.reshape((control_group.shape[0], 1)),
    metric=lambda a,b: int(a != b)
)

In [22]:
#Solve the distance matrix (obtained by adding the propensity score distance matrix to the hispan distance matrix) to minimze the total cost function.
#Where the total cost function is the sum of the distances
#And get the indices of the pairs that minimze this total cost function
treatment_ind_2, control_ind_2 = optimize.linear_sum_assignment(cost_matrix + add_cost_matrix)

1.6) A less naive analysis


In [23]:
#We construct a dataframe whith the rows corresponding to the indices obtaiend above. Note we have the same number of sample in each group by construction
lalonde_ps_matched_2 = pd.concat((treatment_group.iloc[treatment_ind_2], control_group.iloc[control_ind_2]))

In [24]:
display_proportions(lalonde_ps_matched_2)


Observations

The proportion of hispanic people in the two groups is now the same and the only feature that is now unbalanced is the proportion of black people.


In [25]:
compare_groups(lalonde_ps_matched_2)



In [26]:
print_means(lalonde_ps_matched_2)


Control group real earnings in 1978 mean: 5730
Treatment group real earnings in 1978 mean: 6349
Ratio (treatment/control): 1.11

The difference in the salaries we perceived in part 1.4 increased, but not significantly.

Based on this difference, we could say that the "Job Training Program" (JTP) is slightly useful and has a positive effect on average on the salary of the people who took the program. We still have a selection bias by having way more black people in the treatment group and hence any conclusion drawn from these data will be biased. Shrinking the number of samples taken in each group so that we only match hispan with hispan and black with black in each group would result in such a small set that it would not be possible to draw any conclusion.

However it is good to point how far we are from the naive analysis realised in point 1. We had that the mean of the treatment group real earnings in 1978 was 10% lower than the one of the control group. However after we refined the way to analyse the data using propensity score and then late one with matching hispan people with hispan people only, we see that the mean of the treatment group real earnings in 1978 is 10% higher than the one of the control group. This example perfectly shows how a naive analyse could show wrong result. Indeed we go from "the treatment is worst" to "The treatment is worth"

Below you can find a barplot summary of the ratio of the means


In [27]:
#Plot the means we recorded after each improvement
sns.barplot(y=means_ratio_over_improvement, x = ["Naive", "Propensity score", "Propensity score + hispan matching"])


Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f95a4a2a978>

Question 2


In [28]:
from sklearn import metrics
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

from time import time

2.1) Loading, TF-IDF and Spliting

Data fetching


In [29]:
#Loading data
all_news = fetch_20newsgroups(subset='all')
vectorizer = TfidfVectorizer(stop_words='english', max_df=0.5, sublinear_tf=True)

In [30]:
#Vectorizing
news_data = vectorizer.fit_transform(all_news.data)
news_target = all_news.target
news_target_names = all_news.target_names 

feature_names = vectorizer.get_feature_names()

Utility functions

For the following part of the exercise, we created some utility functions that we use here and could be reused for other tasks.


In [31]:
# this could have been done in a simpler way for this homework,
# but it might be useful to have such a powerful function for other uses,
# hence we decide to keep it here so that other could use it too :)

def split(X, y, ratios):
    """
    Split X and y given some ratios
    
    Parameters
    ----------
    X : ndarray
    train matrix
    
    y : ndarray
    test matrix
    
    ratios : list(int)
    ratios on how to split X and y

    Returns
    -------
    out : tuple(ndarray)
    Output one tuple of first, the splits of X and then, the splits of y 
    """
    assert np.sum(ratios) < 1, "sum of ratios cannot be greater than 1"
    assert len(ratios) >= 1, "at least one ratio required to split"
    
    def inner_split(X, y, ratios, acc_X, acc_y):
        ratio, *ratios_remaining = ratios
        X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=ratio)

        if len(ratios_remaining) == 0:
            acc_X.extend([X_train, X_test])
            acc_y.extend([y_train, y_test])
            acc_X.extend(acc_y)
            return tuple(acc_X)
        else:
            acc_X.append(X_train)
            acc_y.append(y_train)
            return inner_split(X_test, y_test, [r/(1.0 - ratio) for r in ratios_remaining], acc_X, acc_y)
    
    return inner_split(X, y, ratios, [], [])

In [32]:
def predict(clf, X_train, y_train, X_test):
    """
    Using a classifier, train with training data it using to fit testing labels
    and then predict some labels of testing data.
    
    It also times the different steps.
    
    Parameters
    ----------
    clf: sklearn classifier
    classifier
    
    X_train: ndarray
    training data
    
    y_train: ndarray
    training labels
    
    X_test: ndarray
    testing data

    Returns
    -------
    out : ndarray
    Output the prediction of labels
    """
    start_time = time()
    print("Prediction computations started...")
    clf.fit(X_train, y_train)
    train_time = time() - start_time
    pred = clf.predict(X_test)
    prediction_time = time() - train_time - start_time
    
    print("...Finished")
    print("Training time = {}s".format(round(train_time)))
    print("Prediction time = {}s".format(round(prediction_time // 1)))
    return pred

In [33]:
def report(results, n_top=3, compared_to=10):
    """
    Print the parameters of the best grid search cross-validation results
    and plot their accuracy compared to another accuracy score. 
    
    Parameters
    ----------
    results: sklearn grid search cv_results_
    grid search cross-validation results
    
    n_top: int
    the number of best results to plot
    
    compared_to: int
    the nth best results to compare the best results with
    
    Returns
    -------
    out : None
    Output some prints and a plot
    """
    means = []
    stds = []
    for i in range(1, n_top + 1):
        candidates = np.flatnonzero(results['rank_test_score'] == i)
        for candidate in candidates:
            mean = results['mean_test_score'][candidate]
            std = results['std_test_score'][candidate]
            means.append(mean)
            stds.append(std)
            print("Model with rank: {}".format(i))
            print("Mean validation score: {0:.4f} (std: {1:.4f})".format(mean, std))
            print("Parameters: {}".format(results['params'][candidate]))
                        
    min_ = np.min(results['mean_test_score'][results['rank_test_score'] == (compared_to)])
    print('\n{0:}\'th score = {1:.4f}'.format(compared_to, min_))
    means = np.array(means) - min_
    
    plt.title("Top {0} best scores (compared to the {1}'th score = {2:.3f})".format(n_top, compared_to, min_))
    plt.bar(range(n_top), means, yerr=stds, align="center")
    plt.xticks(range(n_top), range(1, n_top + 1))
    plt.xlabel("n'th best scores")
    plt.ylabel("score - {}'th score".format(compared_to))
    plt.show()

Data splitting


In [34]:
ratios = [0.8, 0.1] #Ratio is 0.8 for train and twice 0.1 for test and validation 

X_train, X_test, X_validation, \
y_train, y_test, y_validation = split(news_data, news_target, ratios)

2.2) Random Forest

Grid search for parameters tuning


In [35]:
# use a full grid over max_depth and n_estimators parameters
param_grid = {
    "max_depth": [3, 10, 20, None],
    "n_estimators": np.linspace(3, 200, num=5, dtype=int)
    #"max_features": [1, 3, 10],
    #"min_samples_split": [2, 3, 10],
    #"min_samples_leaf": [1, 3, 10],
    #"bootstrap": [True, False],
    #"criterion": ["gini", "entropy"]
}

# run grid search
grid_search = GridSearchCV(RandomForestClassifier(), param_grid=param_grid)
grid_search.fit(X_validation, y_validation)
None #No output cell

After having computed an estimation of our model with many different parameters we choose the best parameters (comparing their mean score and std)


In [36]:
report(grid_search.cv_results_, n_top=5, compared_to=10)


Model with rank: 1
Mean validation score: 0.7374 (std: 0.0132)
Parameters: {'max_depth': None, 'n_estimators': 200}
Model with rank: 2
Mean validation score: 0.7310 (std: 0.0067)
Parameters: {'max_depth': None, 'n_estimators': 150}
Model with rank: 3
Mean validation score: 0.7141 (std: 0.0101)
Parameters: {'max_depth': None, 'n_estimators': 101}
Model with rank: 4
Mean validation score: 0.6806 (std: 0.0151)
Parameters: {'max_depth': None, 'n_estimators': 52}
Model with rank: 5
Mean validation score: 0.6775 (std: 0.0034)
Parameters: {'max_depth': 20, 'n_estimators': 200}

10'th score = 0.6037

Let's save the parameters which give the best result inside a variable


In [37]:
rank_chosen = 1 #Position of the parameters we choose
best_params = grid_search.cv_results_['params'][np.flatnonzero(grid_search.cv_results_['rank_test_score'] == rank_chosen)[0]]

Random forest classification

We reuse the optimal parameters computed above to produce prediction with a random forest classifier


In [38]:
random_forest_clf = RandomForestClassifier(**best_params)

pred = predict(random_forest_clf, X_train, y_train, X_test)


Prediction computations started...
...Finished
Training time = 166s
Prediction time = 0s

In [39]:
#Choose the average type
average_type = "weighted"

#Get the different scores of the predicion computed above
accuracy = metrics.accuracy_score(y_test, pred)
precision = metrics.precision_score(y_test, pred, average=average_type)
recall = metrics.recall_score(y_test, pred, average=average_type)
f1_score = metrics.f1_score(y_test, pred, average=average_type)

print("accuracy  = {:.4f}".format(accuracy))
print("precision = {:.4f}".format(precision))
print("recall    = {:.4f}".format(recall))
print("f1_score  = {:.4f}".format(f1_score))


accuracy  = 0.8568
precision = 0.8615
recall    = 0.8568
f1_score  = 0.8535

As one can see, neither precision, recall or f1_score are adding information. This is because there are quite many classes (20) which are uniformly distributed :


In [40]:
classes = range(len(news_target_names))
def sum_by_class(arr):
    return np.array([np.sum(arr == i) for i in classes])

test_sum_by_class = sum_by_class(y_test)
val_sum_by_class = sum_by_class(y_validation)
train_sum_by_class = sum_by_class(y_train)

p1 = plt.bar(classes, test_sum_by_class)
p2 = plt.bar(classes, val_sum_by_class, bottom=test_sum_by_class)
p3 = plt.bar(classes, train_sum_by_class, bottom=test_sum_by_class + val_sum_by_class)

plt.xticks(classes, news_target_names, rotation='vertical')
plt.tick_params(axis='x', labelsize=15)
plt.legend((p1[0], p2[0], p3[0]), ('test', 'validation', 'train'))

plt.show()


The plot above shows that every class is well represented in the test, training and validation sets.

Confusion matrix

Let's show the confusion matrix


In [41]:
import itertools

# A function to plot the confusion matrix, taken from http://scikit-learn.org/stable/auto_examples/model_selection/plot_confusion_matrix.html#sphx-glr-auto-examples-model-selection-plot-confusion-matrix-py
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [42]:
cnf_matrix = metrics.confusion_matrix(y_test, pred)
# Plot non-normalized confusion matrix
plt.figure(figsize=(25, 15))
plot_confusion_matrix(cnf_matrix, classes=news_target_names, title='Confusion matrix, without normalization')



In [43]:
# Plot normalized confusion matrix
plt.figure(figsize=(25, 15))
plot_confusion_matrix(cnf_matrix, classes=news_target_names, normalize=True, title='Normalized confusion matrix')


What the confusion matrices show is that we did a pretty good joob at assignating the categories except we categorized quite a lot of things in religion.christian instead of religion.misc which is understandable because both categories are closely related. Also atheism is closlely related to religion hence the above average value for ths category but it is still a small value. The last part where we could have done better is with every topics about technology (pc.hardware, mac.hardware, etc.) which is again topics that are very closely related. But overall our classifier can categorize correctly a news and if not it classifies it in a category closely related to the correct one.

feature_importances_ attribute

Let's see what information the feature_importances_ attribute can provide us


In [44]:
importances = random_forest_clf.feature_importances_
std = np.std([tree.feature_importances_ for tree in random_forest_clf.estimators_], axis=0)
#Sort the feature by importance
indices = np.argsort(importances)[::-1]

print("Total number of features = {}".format(len(indices)))


Total number of features = 173446

In [45]:
# Only most important ones (out of thousands)
num_best = 20
best_indices = indices[:num_best]
best_importances = importances[best_indices]
best_std = std[best_indices]

# Plot the feature importances
plt.figure()
plt.title("20 best feature importances")
plt.bar(range(num_best), best_importances, yerr=best_std, align="center")
plt.xticks(range(num_best), np.array(feature_names)[best_indices], rotation='vertical')
plt.tick_params(axis='x', labelsize=15)
plt.xlim([-1, num_best])
plt.xlabel("Feature indices")
plt.ylabel("Feature names")
plt.show()


What we see is that the important features are the ones that could easily permit to exclude a news from some categories because there is an extremely small chance these words appear in a news about those categories. For example it is very unlikely a news about religon talk of "sale", however it is very likely for a news about technology.

The third feature, 'dod', might surprise because it is not a word we hear or read often about. However, if we look it up on the web, we find that it refers the the 'Department of Defense' which is clearly a word only contained in news about politics. This reenforces our insight on the importance of features.